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Abstract 



We present a fast algorithm for approximate Canonical Correlation Analysis (CCA). Given 
a pair of tall-and-thin matrices, the proposed algorithm first employs a randomized dimen- 
sionality reduction transform to reduce the size of the input matrices, and then applies any 
standard CCA algorithm to the new pair of matrices. The algorithm computes an approximate 
CCA to the original pair of matrices with provable guarantees, while requiring asymptotically 
less operations than the state-of-the-art exact algorithms. 

1 Introduction 

Canonical Correlation Analysis (CCA), originally due to [16], is an important technique in statis- 
tics, data analysis, and data mining. CCA has been successfully applied in many machine learning 
applications, e.g. dimensionality reduction [23], clustering [7], learning of word embeddings [9], 
sentiment classification [8], discriminant learning [22], and object recognition [18]. In many ways 
CCA is analogous to Principal Component Analysis (PCA), but instead of analyzing a single data- 
set (in matrix form), the goal of CCA is to analyze the relation between a pair of data-sets (each in 
matrix form). From a statistical point of view, PCA extracts the maximum covariance directions 
between elements in a single matrix, whereas CCA finds the direction of maximal correlation be- 
tween a pair of matrices. From a linear algebraic point of view, CCA measures the similarities 
between two subspaces (those spanned by the columns of each of the two matrices analyzed). 
From a geometric point of view, CCA computes the cosine of the principle angles between the two 
subspaces. 

There are different ways to define the canonical correlations of a pair of matrices, and all these 
methods are equivalent [13]. The linear algebraic formulation of [13], which we present shortly, 
serves our algorithmic point of view best. 



Definition 1. Let A € R mXTt and B € R mx ^ , and assume that p = rank(A) > rank(B) = q. The 
canonical correlations u\ (A, B) > <r 2 (A, B) > ■ ■ ■ > a q (A, B) of the matrix pair (A, B) are defined 
recursively by the following formula: 

<jj(A,B)= max a (Ax, By) =: a (Ax*, By;) , 

i = l,...,q 

where 

• Cr(u,v) = |u T v|/(||u|| 2 ||v||2), 

• Ai = {x : Ax / 0, Ax _L {Axi, . . . , Axi_i}}, 

. Bi = {y : By ^ 0, By JL {B yi , . . . , B yi _i}}. 

TTze um'f vectors Axi/||Axi|| 2 , . . . , Ax ? /|| Ax ? || 2 , Byi/||Byi|| 2 , . . . , By g /||Byg|| 2 are called the canon- 
ical or principal vectors 1 . The vectors xi/||Axi|| 2 , . . . , x ? /|| AxJ 2 , yi/||Byi|| 2 , . . . ,y g /||By ? || 2 are 
called canonical weights (or projection vectors). 

1.1 Main Result 

The main contribution of this paper (see Theorem 15) is a fast algorithm to compute an approxi- 
mate CCA. The algorithm computes an additive-error approximation to all the canonical correla- 
tions. It also computes a set of approximate canonical weights with provable guarantees. We show 
that the proposed algorithm is asymptotically faster compared to the standard method of [4]. To 
the best of our knowledge this is the first sub-cubic time algorithm for approximate CCA that has 
provable guarantees. 

The proposed algorithm is based on dimensionality reduction: given a pair of matrices (A, B), 
we transform the pair to a new pair (A, B) that has much fewer rows, and then compute the 
canonical correlations of the new pair exactly, alongside a set of canonical weights, e.g. using the 
Bjorck and Golub algorithm. We prove that with high probability the canonical correlations of 
(A, B) are close to the canonical correlations of (A, B), and that any set of canonical weights of 
(A, B) can be used to construct a set of approximately orthogonal canonical vectors of (A, B). The 
transformation of (A, B) into (A, B) is done in two steps. First, we apply the Randomized Walsh- 
Hadamard Transform (RHT) to both A and B. This is a unitary transformation, so the canonical 
correlations are preserved exactly. On the other hand, we show that with high probability, the 
transformed matrices have their "information" equally spread among all the input rows, so now 
the transformed matrices are amenable to uniform sampling. In the second step, we uniformly 
sample (without replacement) a sufficiently large set of rows and rescale them to form (A, B). The 
combination of RHT and uniform sampling is often called Subsampled Randomized Walsh-Hadamard 
Transform (SRHT) in the literature [25]. Note that other variants of dimensionality reduction [20] 
might be appropriate as well, but for concreteness we focus on the SRHT. 

Our dimensionality reduction scheme is particularly effective when the matrices are tall-and- 
thin, that is they have much more rows than columns. Targeting such matrices is natural: in 
typical CCA applications, columns typically correspond to features or labels and rows correspond 
to samples or training data. By computing the CCA on as many instances as possible (as much 

^ote that the canonical weights and the canonical vectors are not uniquely defined. 
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training data as possible), we get the most reliable estimates of application-relevant quantities. 
However in current algorithms adding instances (rows) is expensive, e.g. in Bjorck and Golub 
algorithm we pay 0(n 2 + I 2 ) for each row. Our algorithm allows practitioners to run CCA on 
huge data sets because we reduce the cost of an extra row, making it not much more expensive 
than 0(n + t). 

1.2 Related Work 

Dimensionality reduction has been the driving force behind many recent algorithms for acceler- 
ating key machine learning and linear algebraic tasks. A representative example is linear regres- 
sion, i.e., solve the least squares problem min x ||Ax — b||2, where b € M. m . If m > n, then one 
can use the SRHT to reduce the dimension of A and b, to form A and b, and then solve the small 
problem min x || Ax - b|| 2- This process will return an approximate solution to the original prob- 

lem [20, 5, 11]. Alternatively, one can observe that A A and A A are spectrally close, so A is an 
effective preconditioner for A [19, 2]. Other problems that can be accelerated using dimensionality 
reduction include: (i) approximate PCA (via low-rank matrix approximation) [14]; (ii) matrix mul- 
tiplication [20]; (iii) K-means clustering [6]; (iv) approximation of matrix coherence and statistical 
leverage [10]; to name only a few. 

Our approach uses similar techniques as the algorithms mentioned above. For example, Lemma 4 
plays a central role in these algorithms as well. However, our analysis requires the use of advanced 
ideas from matrix perturbation theory and it leads to two new technical lemmas that might be of 
independent interest: Lemmas 10 and 11 provide bounds for the singular values of the product 
of two different sampled orthonormal matrices. Previous work only provides bounds for products 
of the same matrix (Lemma 4; see also [20, Corollary 11]) 

Dimensionality reduction techniques for accelerating CCA have been suggested or used in 
the past. One common technique is to simply use less samples by uniformly sampling the rows. 
While this technique might work reasonably well in many instances, it may fail for others unless 
all rows are sampled. In fact, Theorem 13 analyzes uniform sampling, and establishes bounds on 
the required sample size. 

[23] suggest a two-stage approach which involves first solving a least-squares problem, and 
then using the solution to reduce the problem size. However, their technique involves explicitly 
factoring one of the two matrices, which takes cubic time. Therefore, their method is especially 
effective when one of the two matrices has significantly less columns than the other. When the 
two matrices have about the same number of columns, there is no asymptotic performance gain. 
In contrast, our method is sub-cubic in any case. 

Finally, it is worth noting that CCA itself has been used for dimensionality reduction [24, 7, 23]. 
This is not the focus of this paper; we suggest a dimensionality reduction technique to accelerate 
the computation of CCA. 

2 Preliminaries 

We use i : j to denote the set {i, . . . ,j}, and [n] = 1 : n. We use A, B, . . . to denote matrices 
and a, b, ... to denote column vectors. I n is the n x n identity matrix; mxn is the m x n matrix 
of zeros. We denote by TZ(-) the column space of its argument matrix. We denote by [A; B] the 
matrix obtained by concatenating the columns of B next to the columns of A. Given a subset of 
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indices T C [m], the corresponding sampling matrix S is the \T\ x m matrix obtained by discarding 
from I m the rows whose index is not in T. Note that SA is the matrix obtained by keeping only 
the rows in A whose index appears in T. A symmetric matrix A is positive semi-definite (PSD), 
denoted by ■< A, if x T Ax > for every vector x. For any two symmetric matrices X and Y of 
the same size, X ■< Y denotes that Y — X is a PSD matrix. 

We denote the compact (or thin) SVD of a matrix A G l mxn of rank p by A = Ua£a"V a , 
with U A G M mxp , S A G M pxp , and V A G M pxn . The Moore-Penrose pseudo-inverse of A is 
A + = VaS^UI G M nxm . We denote the singular values of A by a ± (A) > a 2 (A) >■■■> a p (A). 

2.1 The Bjorck and Golub Algorithm 

There are quite a few algorithms to compute the canonical correlations [13]. One of the most 
popular methods is due to [4]. It is based on the following observation. 

Theorem 2 ([4]). Assume that the columns of Q G R mxp (m > p) and W G M. mxq (m > q) form an 
orthonormal basis for the range of A and B (respectively). Let Q T W = USV T be its compact SVD. The 
diagonal elements o/S are the canonical correlations of (A, B). The canonical vectors are given by the first 
q columns o/QU (for A) and WV (for B). 

Theorem 2 implies that once we have a pair of matrices Q and W with orthonormal columns 
whose column space spans the same column space of A and B, respectively, then all we need is 
to compute the singular value decomposition of Q T W. Bjorck and Golub suggest the use of QR 
decompositions, but Ua and Ub will serve as well. Both options require O (m (n 2 + I 2 )) time. 

Corollary 3. Frame Definition 1. Let U a Ub = USV T be its compact SVD. Then, for i G [q]: 
o"j(A, B) = Sjj. The canonical weights are given by the columns o/VaS^U (for A) and V B S B 1 V 
(for B). 

2.2 Matrix Coherence and Sampling from an Orthonormal Matrix 

Given a matrix A with m rows, the coherence of A is defined as m(A) = maxj g [ m ] ||eTUA|||) where 
ej is the i-th standard basis (column) vector of W 71 . Note that the coherence of A is a property of the 
column space of A, and does not depend on the actual choice of A. Therefore, if 1Z(A) = TZ(B) 
then (i(A) = //(B). Furthermore, it is easy to verify that if 11(A) C K(B) then //(A) < //(B). 
Finally, we mention that for every matrix A with m rows: 

rank(A)/m < //(A) < 1. 

We focus on tall-and-thin matrices, i.e. matrices with (much) more rows than columns. We 
are interested in dimensionality reduction techniques that (approximately) preserve the singular 
values of the original matrix. The simplest idea to do dimensionality reduction in tall-and-thin 
matrices is uniform sampling of the rows of the matrix. Coherence measures how susceptible the 
matrix is to uniform sampling; the following lemma shows that not too many samples are required 
when the coherence is small. The bound is almost tight [25, Section 3.3]. 

Lemma 4 (Sampling from Orthonormal Matrix, [25] Corollary to Lemma 3.4). Let Q G W nxd have 
orthonormal columns. Let < e < 1 and < 5 < 1. Let r be an integer such that 

6e~ 2 mfi(Q)log(3d/5) <r <m. 
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Let T be a random subset of [m] of cardinality r, drawn from a uniform distribution over such subsets, and 
let S be the \T\ x m sampling matrix corresponding to T rescaled by y/m/r. Then, with probability of at 
least 1 — 5, for i e [d\: 

Vl^e < ai(SQ) < Vl + ~e. 

Proof. Apply Lemma 3.4 from [25] with the following choice of parameters: £ = aMlog(k/5), 
a = 6/e 2 , and 5t rop p = rj = e. Here, i, a, M, k, rj are the parameters of Lemma 3.4 from [25]; also 
Stropp plays the role of 5, an error parameter, of Lemma 3.4 from [25]. e and 6 are from our Lemma. 
■ 

In the above lemma, T is obtained by sampling coordinates from [m] without replacement. 
Similar results can be shown for sampling with replacement, or using Bernoulli variables [17]. 



2.3 Randomized Walsh-Hadamard Transform 



Matrices with high coherence pose a problem for algorithms based on uniform row sampling. 
One way to circumvent this problem is to use a coherence-reducing transformation. One pop- 
ular coherence-reducing transformation is the Randomized Walsh-Hadamard Transform (RHT) 
matrix. We start with the definition of the deterministic Walsh-Hadamard Transform matrix. 

Fix an integer m = 2 h , for h = 1,2,3,.... The (non-normalized) m x m matrix of the Walsh- 
Hadamard Transform (WHT) is defined recursively as, 



H m — 



H 
H 



m/2 
m/2 



H 



m/2 



-H 



m/2 



with H 2 



+1 +1 
+1 -1 



The m x m normalized matrix of the Walsh-Hadamard transform is H = m~ 

The recursive nature of the WHT allows us to compute HX for an m x n matrix X in time 
0(mn log(ra)). However, in our case we are interested in SHX where S is a r-row sampling 
matrix. To compute SHX only 0(mn\og(r)) operations suffice [1, Theorem 2.1]. 

Definition 5 (Randomized Walsh-Hadamard Transform (RHT)). Let m = 2 h for some positive integer 
h. A Randomized Walsh-Hadamard Transform (RHT) is anm x m matrix of the form 

& = HD 

where D is a random diagonal matrix of size m whose entries are independent random signs, and H is a 
normalized Walsh-Hadamard matrix of size m. 

Lemma 6 (RHT bounds Coherence, [25] Lemma 3.3). Let A be an m x n (m > n, m = 2 h for some 
positive integer h) matrix, and let be an RHT. Then, with probability of at least 1-5, 



3 Perturbation Bounds for Matrix Products 

This section states three new technical lemmas which analyze the perturbation of the singular 
values of the product of a pair of matrices after dimensionality reduction. These lemmas are 
essential for our analysis in subsequent sections, but they might be of independent interest as 
well. We first state three well known results. 
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Lemma 7 ([12] Theorem 3.3). Let * G RP x i and * = D L *D R with D L G M pxp and D R G R'** being 
non-singular matrices. Let 7 = max{||DLD£ — I p ||2, ||D r Dr — I 9 ||2}. Then, for all i = l,..., rank(\I>) : 

|(7i (*) - <7i (*) I < 7 ■ <7j (*) • 

Lemma 8 (Weyl's inequality for singular values; [15] Corollary 7.3.8). Let * G R mxn . Then, for 
alii = 1, ... , min(m, n) : |crj (3>) — (Jj (*) | < ||<& — *||2- 

Lemma 9 (Conjugating the PSD ordering; Observation 7.7.2 in [15]). Let 3>, * 6 M nxn are symmetric 
matrices. Let 3> ^ Then, for every n x m matrix Z : Z T 3>Z ^ Z T \I/Z. 

We now present the new technical lemmas. 

Lemma 10. Lef A G R mx ™ (m > n) and B G M mx ^ (m > I). Define C := [A; B] G R^xfr+O, 
suppose C /zas ranfc so Uc G M mxw . Lef S G R rxm foe any matrix such that \J\ — e < o w (SU C ) < 
(7i (SUc) < VT+~e,for some < e < 1 . Then, for i = 1, . . . , min(n, I), 



Wi (a t b 



(j, ( A T S T SB) I < e - II Alio - IIB| 



2 • 



Proof. Using Weyl's inequality for the singular values of arbitrary matrices (Lemma 8) we obtain, 
\<ji (a t b) - <n (A t S t Sb) I < ||A T S T SB - A T B|| 2 

= ||v A s A (u A s T su B - u A u B ) S b V b || 2 

< ||U A S T SU B - U A U B || 2 • ||A|| 2 • ||B|| 2 . 

Next, we argue that 

||u A s T su B - u A u B || 2 < ||u£s T SU c - Iu,|| 2 . 

Indeed, we now have 

||U A S T SU B -U A U B ||2 = 



sup |w T U A S T SU B z - w'UiUflzl 

w||2=l, l|z||2 = l 

sup |x T S T Sy — x T y| 

x|| a =||y||a=l, x G ^(U A ), yG^(U B ) 



,TxtT ' 



< 
< 



sup 

x|| 2 =||y||2=i, xeTC(u c ), yerc(u B ) 



|x T S T Sy - x T y| 



x T S T Sy - x T y| 



sup 

x|| 2 = b|| 2 =l, xGTC(U c ), yeTC(U c ) 

sup |w T u£S T SU c z - w T u£u c z| 

w|j 2 = l, |jz|| 2 = l 

|u£s T SUc -I w || 2 . 



In the above, all the equalities follow by the definition of the spectral norm of a matrix while the 
two inequalities follow because 7£(U A ) C 7£(Uc) and TZ(Vb) C 7£(Uc), respectively. 

To conclude the proof, recall that we assumed that for i G [a;]: 1 — e < \ (UqS t SUc) < 1 + e. 
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Lemma 11. Let A € R mxn (m > n) and B G W nxf ~ (m > £). Let S € W xm be any matrix such that 
rank(SA) = rank(A) and ranfc(SB) = rank(B), and all singular values o/SU A and SUb are inside 
[a/1 — e, a/1 + e] /of some < e < 1/2. Then, for i = 1, . . . , min(n, £), 

la, (U A S T SU B ) - a* (u£ A U SB ) | < 2e (1 + e) . 

Proof. For every i = 1 , . . . , q we have, 

\<Ti (U A S T SU B ) - ui (ul A U SB ) I = Wi (s^VaATsTSBVbSb 1 ) 

- °i (Ss A Vl A A T S T SBV SB S SB ) I 
< 7 • (s^VAATsTSBVeSi 1 



7 ■ <7j ( U A S T SU B 



< 7-||UIS t || 2 - ( t 4 (SUb) 

< 7-(l + e) 



with 7 = max(||5]-ivl A V A SivIVsAS s i - I p || 2 , ||S- b VsbVb£ b VbVsbSs~ b " \h) ■ 

In the above, the first inequality follows using 2 Lemma 7, while the second follows because 

for any two matrices X, Y : <jj(XY) < ||X|| 2 (Xj(Y). Finally, in the third inequality we used the fact 

that ||U A S T || 2 < \/T+~i and ^ (SU B ) < y/TTe. 

We now bound ||^sA^SA^AS A V A VsASg A - I p || 2 . The second term in the max expression 

of 7 can be bounded in a similar fashion, so we omit the proof. 

II s sa v sa v a£! a V a Vsa£! sa - i p || 2 = ||£ sa v| a a t av S a£ sa - Iplb 

= ||Ul A ((SA) + ) T A T A(SA)+UsA - u! a UsaU! a U S a||2 
= ||U| A (((SA)+) T A T A(SA)+ - U SA Ul A ) U SA || 2 

= ||((SA) + ) T A T A(SA) + -U SA Ul A || 2 



where we used A T A = V A £ A V A , (SA) + U SA = Vsa^sa- Recall that, all the singular values of 
SU A are between \f\ - e and a/1 + e, so: 

(l - e)I p < U A S T SU A < (1 + e)I p . 

Conjugating the PSD ordering with S A V A (SA) + (see Lemma 9), it follows that 

(1 - e)((SA)+) T A T A(SA)+ < U SA Ul A < (l + e)((SA)+) T A T A(SA) + 

since A T A = V A S A V A and SA(SA) + = Us A U| A . Rearranging terms, it follows that 

^-LusaUIa < ((SA)+) T A T A(SA)+ < _l_u SA Ui A 

2 Set * = S^VIA^^BVbS" 1 , D l := Eg A V| A V A EA and D fi := E B V B V SB £ S] ^. Dl and D H are 
non-singular, as a product of non-singular matrices. Moreover, Dl^D/j = Eg A V SA A T S T SBV SB £g B , since 
A = AV A V A , B = BV B V B . 
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Since < e < 1/2, it holds that 3^/3 < 1 + 2e and ^ > 1 - e, hence 



-2eU SA Ul A ^ ((SA) + ) T A T A(SA) 
This implies that 



UsaU1 a ^ 2eU SA U| A 



||((SA) + ) 1 A 1 A(SA) + - Us A U4 A || 2 <2e||U SA U^ A || 2 = 2e. 

Indeed, let x + be the unit eigenvector of the symmetric matrix ((SA) + ) T A T A(SA) + — Us A U| A 
corresponding to its maximum eigenvalue. The PSD ordering implies that 

A max (((SA)+) T A T A(SA) + - U SA Ul A ) < 2exTu SA Ul A x+ < 2e||U SA Ul A || 2 = 2e. 

Similarly, A min (((SA) + ) T A T A(SA) + - U SA U| A ) > -2e, which shows the claim. ■ 
Lemma 12. Repeat the conditions of Lemma 10. Then, for all w e W 1 and y e R e , we have 

< e ■ || Aw|| 2 • || By || 2 . 

Proof. Let E = U A S T SU B - U A U B . Now, 



w T A T By - w T A T S T SBy 



w T A T By- w'A'S'SBy 



T * T c T t 



w T V A I] A EI]BV B y| 

< ||w T V A £ A || 2 ||E|| 2 ||£ B V B y|| 2 

= ||w T V A £ A U A || 2 ||E|| 2 ||U B £ B V B y|| 2 

= ||w T A T || 2 ||E|| 2 ||By|| 2 

= ||E|| 2 || Aw|| 2 ||By|| 2 



Now, Lemma 7 ensures that ||E|| 2 < e. 



4 CCA of Row Sampled Pairs 

Given A and B, one straightforward way to accelerate CCA is to sample rows uniformly from 
both matrices, and to compute the CCA of the smaller matrices. In this section we show that if we 
sample enough rows, then the canonical correlations of the sampled pair are close to the canonical 
correlations of the original pair. Furthermore, the canonical weights of the sampled pair can be 
used to find approximate canonical vectors. Not surprisingly, the sample size depends on the 
coherence. More specifically, it depends on the coherence of [A; B]. 

Theorem 13. Suppose A 6 ^ m ><™ ( m > n ) h as mn \ p and B e R mxe (m > I) has rank q < p. 
Let < e < 1/2 be an accuracy parameter and < 5 < 1 be a failure probability parameter. Let 
ui = rank([A; B]) < p + q. Let r be an integer such that 

54e~ 2 m / u([A; B]) log(12w/<5) < r < m . 

Let T be a random subset of [m] of cardinality r, drawn from a uniform distribution over such subsets, 
and let S G R rxm fa the sampling matrix corresponding to T rescaled by y/m/r. Denote A = SA and 
B = SB. 
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Let oi, . . . , a q be the exact canonical correlations of (A, B), and let 

wi = xi/||Axi|| 2 , ...,w p = x ? /||AxJ 2 , 

and 

Pi = yi/||Byi|| 2 , ...,p q = y g /||By ? || 2 

be the exact canonical weights of (A, B). With probability of at least 1 — 5 all the following hold simultane- 
ously: 

(a) (Approximation of Canonical Correlations) For every i = 1,2, ... ,q: \cii (A, B) - Oi (SA, SB) | < 
e + 2e 2 /9 = 0(e) • 

(b) (Approximate Orthonormal Bases) The vectors {A\Vi} ie ^form an approximately orthonormal basis. 
That is, for any c £ [q], 

1 9 1 

< AwJ < 



l + e/3 ~ """ c " 2 - 1 - e /3 
and for any i / j, 

| (Aw;, AWj) I < g-^. 

Similarly, for the set of {Bpi} ie ^ q y 
(c) (Approximate Correlation) For every i = 1, 2, . . . , q: 

g«(A,B) e/3 a,(A,B) e/3 

Proo/ Let C := [Ua; Ub]. Lemma 4 implies that each of the following three assertions hold with 
probability of at least 1 — 6/ 3, hence all three hold simultaneously with probability of at least 1 — 6: 



For every r £ \p\. 
For every k £ [q]: 
For every /i £ [cj]: 



V 7 ! - e/3 < a r (SU A ) < y/l + e/3. 
Vl ~ e/3 < ffi(SU B ) < \A + e/3 . 
a/1^73 < ffh(SUc) < a/1^73 • 



We now show that if indeed all three hold, then (a)-(c) hold as well. 

Proof of (a). Corollary 3 implies that cr^A, B) = o-j(U A U B ) and cr^SA, SB) = 0;(U| A U SB ). 
We now use the triangle inequality to get, 



\a t (A, B) - a,, (SA, SB) | = ^ (U A U B J - <n (ul A U SB/ 

< Wi (u A u B ) - 0j (u A s T su B ) i + \<Ti (u A s T su B ) - en (ul A u SB ) |. 

To conclude the proof, use Lemma 10 and Lemma 11 to bound these two terms, respectively. 
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Proof of (b). For any c G [q], 

||Aw c || 2 = || Aw c || 2 /|| Aw c || 2 

since || Aw c || 2 = 1. Now use Lemma 12. 
For any i ^ j 

| (Awj, Awj) | < |w l T A T Aw J -| + |w J T (A T A-A T A)w J | 

= |wJ(A T A-A T A) Wj | 

< -||AWi||2||AWj||2 

3 

ley 

< 1 l e y 3 l|Aw < || 2 ||Aw J -||2 



3-e 



In the above, we used the triangle inequality the fact that the w/s are the canonical weights of A, 
and Lemma 12. 

Proof of (c). We only prove the upper bound. The lower bound is similar, and we omit it. 



a (Aw 4 ,Bpi) = 



(Awj, Bpj) 
Awj^UBpil^ 



1 / / » n \ T ( T -w~* 1 T 



1 - e/3 



((Aw,, Bp,) + w 4 T (A T B - A T B) p< ) 



o-(Ajq,Byi) /3 

< — , — - + — • l|AwJ 2 • llBpj 

l-e/3 l-e/3 11 " 11 V 



< 



crLAwi,BpjJ e ^ 3 



l-e/3 (l-e/3) 2 



In the above, the first equality follows by the definition of cr(-, •), the first inequality by using 
1 = 1 1 A Wi 1 1 1 < (1 + e)||Awj||2 (same holds for Bpj), the second inequality from Lemma 12, the 
third inequality by using (1—e) || Awj || I < ||Awi||| = 1 (same holds for Bp^), and the last inequality 
by (a). ■ 

5 Fast Approximate CCA 

First, we define what we mean by approximate CCA. 

Definition 14 (Approximate CCA). For < rj < 1, an ^-approximate CCA of (A, B), is a set of 

positive numbers a±, . . . , a q together with a set of vectors wi, . . . , w q for A and a set of vectors pi, . . . , p q 
for B, such that 

(a) For every i £ [q], 

|<7i(A,B) - &i\ < rj. 
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Algorithm 1 Fast Approximate CCA 



1: Input: A G R mxn of rank p,B £ R mx£ of rank q, < e < 1/2, and 5 (n > I, p > q) 

! — min(54e~ 2 



Vn + £+ y / 81og(12m/(5) log(3(n + i)/S), m) 
Let S be the sampling matrix of a random subset of [m] of cardinality r (uniform distribution). 
Draw a random diagonal matrix D of size m with ±1 on its diagonal with equal probability. 
A < — SH • (DA) using fast subsampled WHT (see Section 2.3). 
B < — SH (DB) using fast subsampled WHT (see Section 2.3). 

Compute and return the canonical correlations and the canonical weights of (A, B) (e.g. using 
Bjorck and Golub's algorithm). 



|AWj||l - 1| < Tj. 



(b) For every i € [q], 
and for i / j, 

| (Awj, Awj) \<rj. 

Similarly, for the set of {Bpi} ie ^ q y 

(c) For every i £ [q], 

|<7i(A,B) -<7(Awi,Bpi)| < rj. 

We are now ready to present our fast algorithm for approximate CCA of a pair of tall-and-thin 
matrices. Algorithm 1 gives the pseudo-code description of our algorithm. 

The analysis in the previous section (Theorem 13) shows that if we sample enough rows, 
the canonical correlations and weights of the sampled matrices are an 0(e)-approximate CCA 
of (A, B). However, to turn this observation into a concrete algorithm we need an upper bound 
on the coherence of [A; B]. It is conceivable that in certain scenarios such an upper bound might 
be known in advance, or that it can be computed quickly [10]. However, even if we know the 
coherence, it might be as large as one, which will imply that sampling the entire matrix is needed. 

To circumvent this problem, our algorithm uses the RHT to reduce the coherence of the matrix 
pair before sampling rows from it. That is, instead of sampling rows from (A, B) we sample rows 
from (0A, @B), where is a RHT matrix (Definition 5). This unitary transformation bounds 
the coherence with high probability, so we can use Theorem 13 to compute the number of rows 
required for an 0(e)-approximate CCA. We now sample the transformed pair (0A, 0B) to obtain 
(A, B). Now the canonical correlations and weights of (A, B) are computed and returned. 

Theorem 15. With probability of at least 1 — 5, Algorithm 1 returns an O(e)-approximate CCA of (A, B). 
Assuming Bjorck and Golub's algorithm is used in line 7, Algorithm 1 runs in time 

O ^mn log m + e~ 2 \fn + \J\og(m/ 5) \og(n/5)n 2 ^j . 

Proof. Lemma 6 ensures that with probability of at least 1 — 5/2, 

p{[@A;&B}) < — (v / ^T^+ \/81og(3m/(5)) 2 . 
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Assuming that the last inequality holds, Theorem 13 ensures that with probability of at least 
1 — 5/2, the canonical correlations and weights of (A, B) form an 0(e)-approximate CCA of 
(0A, @B). By the union bound, both events hold together with probability of at least 1 — 5. 
The RHT transforms applied to A and B are unitary, so for every -q, an 77-approximate CCA of 
(0A, @B) is also an ^-approximate CCA of (A, B) (and vice versa). 

Running time analysis. Step 2 takes 0(1) operations. Step 3 requires O(r) operations. Step 4 
requires 0{m) operations. Step 5 involves the multiplication of A with SHD from the left. Com- 
puting DA requires 0(mn) time. Multiplying SH by DA using fast subsampled WHT requires 
0(mn log r) time, as explained in Section 2.3. Similarly, step 6 requires 0(m£ log r) operations. 
Finally, step 7 takes 0(rnl + r(n 2 + £ 2 )) time. Assuming that n > I, the total running time is 
0{rn 2 + mnlog(r)). Plugging the value for r, and using the fact that r < m, established our 
running time bound. 

■ 

From a practical point of view, our algorithm is useful for measuring the size of the correlated 
subspace, and obtaining the principal vectors of it. e is 0.1, or perhaps 0.01. So for reasonably high 
correlations, say above 0.2, we get some useful information. However, for lower correlations we 
get no information at all. Furthermore, it is too expensive to compute all the principal vectors, but 
once we know the size of the correlated subspace we can use the approximate weights to compute 
the vectors for that subspace. 

6 Relative vs. Additive Error 

Now, we demonstrate that, unless r ps m, it is not possible to replace the additive error guarantees 
of Theorem 15 with relative error guarantees. 

Lemma 16. Assume that given any matrix pair (A, B) and any constant < e < 1, Algorithm 1 computes 
a pair (A, B) by setting a sufficient large value for r in Step 2 so that the canonical correlations are relatively 
preserved with constant probability, i.e., with constant probability: 

(1 - e)<7i(A, B) < <Ti(A, B) < (1 + e)<7i(A, B) 

i = l,...,q. 

Then, it follows that r = fi(m/log(m)). 

Proof. The proof follows by a reduction to the set disjointness communication complexity prob- 
lem. That is, assume that Alice gets an x G {0, 1}™ as input and Bob gets y G {0, l} m . Their goal 
is to decide if there exists i G [m] so that Xi = m = 1 by exchanging as less information as possible. 
It is known that the randomized communication complexity of this problem is Q(m), see [3] for a 
modern proof. 

Set e = 1/2 and let S be a constant in Algorithm 1. Now, Alice and Bob can compute x = 
\/mSHDx and y = y^SHDy, respectively (using shared randomness). Then, Alice sends to Bob 
x. With constant probabilty it holds 

1 (x, y) < 1 g y) < 3 (x, y) ^ 

2||x||||y|| — r||x||||y|| — 2 ||x||||y|| 
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Now, Bob can decide if there exists i, so that Xi = yi = 1 by checking if (x, y) is zero or not. 
Hence, this protocol decides the set disjointness problem. Now, since ^/mSHD is an r x m matrix 
with entries from {—1, +1}, it follows that || xjloo < m. Therefore, we can encode x using at most 
r log(2m) bits. It follows by the linear lower bound for set disjointness that r log(2m) > Cm for 



7 Experiments 

We now report the results of a few small-scale experiments. Our experiments are not meant to be 
exhaustive; however, they do show that our algorithm can be modified slightly to achieve very 
good performance in practice while still producing acceptable errors. 

Our implementation of Algorithm 1 differs from the pseudo-code description in two ways. 
First, we use 



for setting the sample size, i.e. we keep the same asymptotic behavior, but drop the constants. The 
constants in Algorithm 1 are rather large, so they preclude the possibility of beating Bjorck and 
Golub's algorithm for reasonable matrix sizes. Our implementation also differs in the choice of 
underlying mixing matrix. Algorithm 1, and the analysis, uses the WHT However, it is possible 
to show that other Fourier-type transforms will work as well (the bounds remain unchanged), and 
that some of these alternative transforms have certain advantages that make them better suited 
for an actual implementation [2]. Specifically, we use the implementation of randomized Discrete 
Hartley Transform in the Blendenpik library 3 . 

We report the results of three experiments. In each experiment we run our code five times on 
a fixed pair of pair of matrices (datasets) A and B, and compared the different outputs to the true 
canonical correlations. The first two experiments involved synthetic data-sets, for which we set 
e = 0.25 and 5 = 0.05. The last experiment was conducted on a real-life dataset, and we used 
e = 0.5 and 5 = 0.2. All experiments were conducted in a 64-bit version of MATLAB 7.8. We used 
a two quad-core Intel E5410 computer running at 2.33 GHz, with 32GB DDR2 800 MHz RAM, 
running Linux 2.6, but we use a single core only. 

Synthetic Experiment 1. In this experiment we first draw five random matrices: three matrices 
G, W, Z G W nxn with independent entries from the normal distribution, and two matrices X, Y € 
i nxn with independent entries from the uniform distribution on [0, 1]. We now set A = GX + 0.1- 
W and B = GY + 0.1 • Z. We use the sizes m = 120, 000 and n = 60. Conceptually, we first take a 
random basis (the columns of G), and linearly transform it in two different ways (by multiplying 
by X and Y). The transformation does not change the space spanned by the bases. We now add to 
each base some random noise (0.1 • W and 0.1 • Z). Since both A and B essentially span the same 
column space, only polluted by different noise, we expect (A, B) to have mostly large canonical 
correlations (close to 1), but also a few small ones. Indeed, Figure 1(a), which plots the canonical 
correlations of this pair, shows that this is the case. 

Figure 2(a) shows the (signed) error in approximating the correlations, in five different runs. 
The actual error is always an order of magnitude smaller than the input e; the maximum absolute 
error is only 0.011. For large canonical correlations the error is much smaller, and the approxi- 
mated value is very accurate. For smaller correlations, the error starts to get larger, but it is still an 

3 Available at http:/ /www.mathworks.com/matlabcentral/fileexchange/25241-blendenpik. 



some constant C > 0. 
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Figure 1: The exact canonical correlations. 
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Figure 2: Error in approximation of the canonical correlations. 



order of magnitude smaller than the actual value for the smallest correlation. As for the running 
time, the proposed algorithm takes about 40% less time than Bjorck and Golub's algorithm (3 sec 
vs. 5 sec). 

Synthetic Experiment 2. In this experiment we first draw three random matrices. The first matrix, 
X E W nxn has independent entries from the normal distribution. The second matrix Y G W nxk 
has independent entries which take value ±1 with equal probability, and the third matrix Z E 
M fcxn has independent entries from the uniform distribution on [0, 1]. We now set A = X + 0.1 • 
Y • (lfcxn + Z) and B = Y, where lkxn is the k x n all-ones matrix. We use the sizes m = 80, 000, 
n = 80 and k = 60. Here we basically have noise (B) and a matrix polluted with that noise (A). So 
there is some correlation, but really the two subspaces are different; there is one large correlation 
(almost 1) and all the rest are small (Figure 1(b)). 

Figure 2(b) shows the (signed) error in approximating the correlations, in five different runs. 
The actual error is an order of magnitude smaller than the target e; the maximum absolute error 
is only 0.02. Again, for the largest canonical correlation (which is close to 1) the result is very 
accurate, with tiny errors. For the other correlations it is larger. For tiny correlations the error 
is about the same magnitude as the actual value. Interestingly, we observe a bias towards over- 
estimating the correlations. As for the running time, the proposed algorithm takes about 30% less 
time than Bjorck and Golub's algorithm (3.1 sec vs. 4.5 sec). 

Real-life dataset: Mediamill. We also tested the proposed algorithm on the annotated video 
dataset from the Mediamill Challenge [21] 4 . Combining the training set and the challenge set, 
43907 images are provided, each image is a representative keyframe image of a video shot. The 
dataset provides 120 features for each image, and the set is annotated with 101 labels. Figure 1(c) 
shows the exact canonical correlations. We see there is a few high correlations, with very strong 
decay afterwards. 

4 The dataset is publicly available at http: / / www.csie.ntu.edu.tw/ cjlin/libsvmtools / datasets 
/ multilabel.html##mediamill. 
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Figure 2(c) shows the (signed) error in approximating the correlations, in five different runs. 
The maximum absolute error is rather small (only 0.055). For the large correlations, which are the 
more interesting ones in this context, the error is much smaller, so we have a relatively high accu- 
racy approximation. Again, there is an interesting bias towards over-estimating the correlations. 
As for the running time, the proposed algorithm is considerably faster than Bjorck and Golub's 
algorithm (2.05 sec vs. 5.84 sec). 

Summary. The experiments are not exhaustive, but they do suggest the following. First, it appears 
that the sampling size bounds are rather loose. The algorithm achieves much better approximation 
errors. Second, there seems to be a connection between the canonical correlation value and the 
error: for larger correlations the error is smaller. Our bounds fail to capture these phenomena. 
Finally, the experiments show that the proposed is faster than Bjorck and Golub's algorithm in 
practice on both synthetic and real-life datasets, even if they are fairly small. 
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